Introduction

The aim of this project is to use machine learning to predict wheather or not a patient will have storke or not. The dataset was pulled from Kaggle.The dataset was originally collected by the WHO (world health organisation ) lets import the necessary libraries we will be using from the begining to the end.

library(tidymodels)
library(tidyverse)
library(glmnet)
library(modeldata)
library(ggthemes)
library(janitor)
library(dplyr)
library(naniar)
library(caTools)
library(vip)
library(skimr)
library(recipes)
library(themis)
library(corrplot)
tidymodels_prefer()

Why do we need to predict stroke

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths. This dataset consist of different parameters that ccan contribute to patient prone to stroke. I picked this dataset becasue of my interest in medical AI. stroke, also known as a cerebrovascular accident (CVA), is a medical emergency that occurs when the blood supply to the brain is interrupted or reduced, depriving brain cells of oxygen and nutrients. This can cause brain cells to die or become damaged, leading to various neurological deficits, such as paralysis, difficulty speaking, or cognitive impairment.The main type of stroke are the ischemic and hemorrhagic.

Ischemic strokes occur when a blood vessel supplying blood to the brain is blocked, usually by a blood clot. This is the most common type of stroke, accounting for around 87% of all cases. The most common causes of ischemic stroke include atherosclerosis (the buildup of plaque in the arteries), heart disease, high blood pressure, diabetes, and smoking.

Hemorrhagic strokes occur when a blood vessel in the brain ruptures, causing bleeding in or around the brain. This type of stroke is less common, accounting for around 13% of all cases. The most common causes of hemorrhagic stroke include high blood pressure, brain aneurysms (weak spots in the blood vessel wall), and arteriovenous malformations (AVMs), which are abnormal tangles of blood vessels in the brain.

Other risk factors that can increase the likelihood of stroke include age (the risk of stroke increases with age), family history of stroke, previous stroke or transient ischemic attack (TIA), obesity, sedentary lifestyle, alcohol abuse, and use of certain medications, such as birth control pills and blood thinners.

Most of the risk factors that increases the chances in huaman are present in our dataset.

an image caption Source: Dana Foundation 3. Now, we understand what “stroke” means, lets start!

Project Workflow

In order to construct an efficient binary classification model, the following procedures will be utilized. Our first step is to investigate the dataset through visualization so that we can comprehend each variable’s coverage in the data. Second, we will clean up the data by identifying missing variables and inconclusive terms that could influence our prediction. The variables will then be used to predict whether or not a patient will experience a stroke. We will run a training and test split on the data, develop a recipe, and set the folds for the 10 fold cross validation that we will implement.. To model the training data, we will utilize binary classification models such as Logisitic Regression, Boostedtrees, Random Forest, and K nearest neighbors. Once the findings of each model have been acquired, they will be applied to the test set to measure their accuracy in predicting stroke. In addition, we will visually represent the metrics and confusion matrix of the best model for the test dataset.

Exploratory Data Analysis

Before We proceed on the data modelling, we need to deal with data quality issues by manipulating the data ,dealing with missing values and exploring relationships. The dataset contains 5110 rows and 12 columns .The dataset consist of the following variables

  • id: unique identifier

  • gender: “Male”, “Female” or “Other”

  • age: age of the patient

  • hypertension: 0 if the patient doesn’t have hypertension, 1 if the patient has hypertension heart_disease: 0 if the patient doesn’t have any

  • heart diseases, 1 if the patient has a heart disease

  • ever_married: “No” or “Yes” work_type: “children”, “Govt_jov”, “Never_worked”, “Private” or “Self-employed” Residence_type: “Rural” or “Urban”

  • avg_glucose_level: average glucose level in blood;less than 5.6 mmol/L is normal. A fasting blood sugar level from 5.6 to 6.9 mmol/L is considered prediabetes. if it is (7 mmol/L) this means the patient has diabetes

  • bmi: body mass index;8 or lower: underweight. 18.5 to 24.9: normal, healthy weight. 25 to 29.9:

  • overweight. 30 or higher: obese.

  • smoking_status: “formerly smoked”, “never smoked”, “smokes” or “Unknown”* in smoking_status means that the information is unavailable for this patient

  • stroke: 1 if the patient had a stroke or 0 if not

Reading Dataset

data<-read.csv('/Users/iderasalami/Documents/PSTAT 231/FINAL PROJECT/stroke-data.csv',header=TRUE)%>% #reading the csv file
  clean_names() # clean  the column header 
dim(data)
[1] 5110   12

we have 5110 observations and 12 columns

data <- data %>%
  mutate(smoking_status = factor(smoking_status),
         stroke = factor(ifelse(stroke == 1, "yes", "no"), levels = c("yes", "no"))) %>% 
  select(-id)

We convert the smoking status to a factor variable and specifying the yes as 1 anad “no” as 0 , we also removed the id column since this doesnot contribute to stroke risk or attributes.

sum(is.na(data))
[1] 0

From the above result the bmi as other ’ N/A’ makes impossible to see the sum of na values. I will replace the N/A with na to know the true missing values.

data <- data %>%
  replace_with_na_all(condition = ~.x == "N/A")

We just replaced the na so we can be able to work with the na values easily

data$heart_disease<-as.factor(data$heart_disease) #convert to factor variable
data$hypertension<-as.factor(data$hypertension)
data$gender<-as.factor(data$gender)
data$work_type<-as.factor(data$work_type)
data$residence_type<-as.factor(data$residence_type)
data$ever_married<-as.factor(data$ever_married)
data$bmi <-as.double(data$bmi)

Visual EDA

Lets visualize the variables and explore the relationship between the parameters

Gender distribution

data%>%
  ggplot(aes(gender))+
  geom_bar(color = "black",fill="blue")+
  theme_bw()

The female records are higher than the male. Also, we have “other” samples which is very low, we will later remove this in the later section of this project.

data%>%
  ggplot(aes(work_type))+
  geom_bar(fill='blue')+
  theme_bw()

The number of patients that works in the private sector is higher than the other categories

Smoking Status

data%>%
  ggplot(aes(smoking_status))+
  geom_bar(fill='red')+
  theme_bw()

We have “Unknown” in the smoking status, which leads to many questions like who are the ones that fill unknown, was this filled by the adults or children.We will deal with this later in the project section.

BMI/AGE

library(ggplot2)
ggplot(data, aes(avg_glucose_level, bmi)) +
  geom_point(aes(color = age), alpha = 0.6, size = 1) +
  scale_fill_brewer(palette = "Set1", direction = -1) +
  guides() +
  xlab("avg glucose level")

Apart from getting bmi having missing values, there is outliers in the bmi. From the visualisation, there are many patients that seems to have higheer bmi which is far off from the other points

Relationship between Target and Other Variables

Residence type

data %>% 
  dplyr::select(stroke,residence_type) %>%
  group_by(stroke, residence_type) %>%
  summarise(nCount = n()) %>%
  mutate(percent = nCount/sum(nCount) * 100) %>% 
  
  ggplot(aes(x = stroke, y = percent, fill = residence_type)) +
  geom_bar(stat = 'identity')+
  labs(
    x = 'Stroke',
    y = 'Percent (%)'
  ) +
  theme(
    panel.border = element_blank(),
    panel.grid = element_blank()
  ) +
  geom_text(aes(label = paste(round(percent, 2), '%')), position = position_stack(vjust = 0.5))

From the diagram, we can see that the percentage of of people in urban area with stroke is 54/% and the percentage iof people with stroke in the rurall area is rural area with stroke is 45% which signifies that the number of people with stroke in the urban area is higher than the rural areas . we can tell from this that it does makes sense we can attribute the urban area life to this health issues, the daily stress, pollution and everything attached with living the urban life.

Work Type

data %>% 
  dplyr::select(stroke,work_type) %>%
  group_by(stroke, work_type) %>%
  summarise(nCount = n()) %>%
  mutate(percent = nCount/sum(nCount) * 100) %>% 
  
  ggplot(aes(x = stroke, y = percent, fill = work_type)) +
  geom_bar(stat = 'identity')+
  labs(
    x = 'Stroke',
    y = 'Percent (%)'
  ) +
  theme(
    panel.border = element_blank(),
    panel.grid = element_blank()
  ) +
  geom_text(aes(label = paste(round(percent, 2), '%')), position = position_stack(vjust = 0.5))

The private work type contributes more to stroke compare to others. On the other hand, we have the childen,whose data is not fully represented. We will deal with this later on in the tidying model section.

Age

ggplot(data, aes(factor(stroke), age)) +
  geom_boxplot() + 
  #geom_jitter(alpha = 0.1) +
  xlab("stroke")

The likelihood of suffering a stroke in life increases with age. From the previous diagram, we can see that because strokes are so uncommon, children cannot even be included in any risk factors. The visualisations makes sense, demonstrating that the risk of stroke at the lower age between 0-35, is very low and close to impossible but as the age breaks out of this loop then stroke risk gets higher.

Smoking Status

library(ggplot2)
data %>%
  ggplot(aes(x = smoking_status, fill = stroke)) +
  geom_bar()

### Correlation

data %>% 
  select(is.numeric) %>% #selecting the numeric columns
  cor(use = "complete.obs") %>% 
  corrplot(type = "lower", diag = T)

We can infer that the numerical variables are negatively connected from the correlation plot above.

Tidying the data

After exploring the data, there is a need for cleaning the dataset. We can notice that we have the NAs in the bmi variable and the “unknown” in the smoking status.

data %>% filter(is.na(bmi)) %>%  #select the na values and group by stroke categories 
  group_by(stroke) %>% 
  count()
# A tibble: 2 × 2
# Groups:   stroke [2]
  stroke     n
  <fct>  <int>
1 yes       40
2 no       161

We can see that the na values for bmi falls more in the no and with our “no” on the higher part of the target we can remove thena values

data <- data %>%
  replace_with_na_all(condition = ~.x == "Unknown")

Lets replace the unknown with na to be able to manipulate easily

data <- data %>% filter(gender != "Other")
skim(data) %>%
  yank("numeric")

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1.00 43.23 22.61 0.08 25.00 45.00 61.00 82.00 ▅▆▇▇▆
avg_glucose_level 0 1.00 106.14 45.29 55.12 77.24 91.88 114.09 271.74 ▇▃▁▁▁
bmi 201 0.96 28.89 7.85 10.30 23.50 28.10 33.10 97.60 ▇▇▁▁▁

Now that we have a bmi to deal with for our missing data, let’s investigate how to handle the unknown variable we had in the smoking status. Lets use the age to get more information about the the “unknown variable in the data.

data %>% filter(work_type == "children") %>% 
  group_by(smoking_status) %>% 
  summarise(N = n(), 
            avg.age = mean(age), 
            max.age = max(age), 
            min.age = min(age))
# A tibble: 4 × 5
  smoking_status      N avg.age max.age min.age
  <fct>           <int>   <dbl>   <dbl>   <dbl>
1 formerly smoked    13   11.8       15   10   
2 never smoked       54   12.5       16   10   
3 smokes              2   11         12   10   
4 <NA>              618    6.23      16    0.08

The unknowns we have in the smoking status are all children. so we can impute the unknown as never smoked so with this,lets replace all nas in the smoking status as “never smoked”

library(tidyr)

# Replace missing values in column x with "never smoked"
data <- data %>%
  replace_na(list(smoking_status = "never smoked"))
data %>% filter(work_type == "children") %>% 
  group_by(smoking_status) %>% 
  summarise(N = n(), 
            avg.age = mean(age), 
            max.age = max(age), 
            min.age = min(age))
# A tibble: 3 × 5
  smoking_status      N avg.age max.age min.age
  <fct>           <int>   <dbl>   <dbl>   <dbl>
1 formerly smoked    13   11.8       15   10   
2 never smoked      672    6.73      16    0.08
3 smokes              2   11         12   10   

We now have an aggregate of 672 for never smoked after we discovered the “unknowns” as children

Missing data visual

library(naniar)
vis_miss(data)

We only have 0.4 % of the missing values in bmi we can drop the na values here

data <- data %>% drop_na()
sum(is.na(data))

[1] 0

Yayy! Now that we have a clean dataset to use for training, we should check in with our response variable to see the proportion if the dataset is unbalanced.

Target Variable

Lets check on our target variable

# calculate the proportion of stroke values
prop_stroke <- data %>% 
  count(stroke) %>% 
  mutate(prop_stroke = n / sum(n))

# view the proportion of stroke values
prop_stroke
# A tibble: 2 × 3
  stroke     n prop_stroke
  <fct>  <int>       <dbl>
1 yes      209      0.0426
2 no      4699      0.957 

Ouch! From the above, we can see that the target is highly imbalance and we will be dealing with this by using a machine learning method called “upsampling” method to increase the number of yes class. We will deal with this in our training data section.

Setting up Model

In this section, we will be setting up for our model building by splitting the data, building recipe and 10 folds validation

Data Splitting, Imputation, Up-sampling and Recipe

set.seed(3431)
data_split <- initial_split(data,prop=0.70, strata = "stroke") #splitting the data

data_train <- training(data_split)
data_test <- testing(data_split)

dim(data_train)
[1] 3435   11

We divided the dataset into 30% for testing and 70% for training. To guarantee the same distribution throughout the subsets, we stratified our target variable, “stroke.” After dividing, we have 1473 rows for testing and 3435 rows for training.

dim(data_test)
[1] 1473   11

Buiding Recipe

One of the most important step in preprocessubg data is building a recipe. We will be using all our variables as our predictor, recall in the introduction section which we discussed variables that contributes to patient risk to stroke, this shows the importance of each variables in our dataset. In addition, dummy code the categorical variables which are age,gebder,smoking status and normalize all the predictors.Also, since wee are dealing with imbalance target variable , we will be using the upscaling method to balance the variable.

data_recipe <- recipe(stroke ~ ., data = data_train) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_predictors()) %>% 
  step_upsample(stroke, over_ratio = 0.5) 

Now, we are sure of having 10 predictors and one outcome whcih will be our target. Lets build model to determine if a patient will have stroke or not

Now, We have zero missisng value for the training data

K-Fold Cross Validation

We’ll stratify our cross validation on our target variable, Stroke , as well as use 10 folds to perform stratifies cross validation.

data_folds <- vfold_cv(data_train, v = 10, strata = "stroke")

Model Building Workflow

The setting up of workflow is the same with all the models except from the KNN. The following steps were strictly followed :

  1. Set up the model by specifying the engine and the mode, we specified classification since this fit so well with out goal.

  2. Set up the workflow of the model, and add our stroke recipe(data_recipe) We will then skip the step 3-5 for our KNN since it is a simpler model that doesn’t need to be tuned

  3. Set up the tunning grid with parameters that we want tuned and number of levels of tuning for each parameter

  4. Yayy! we will tune the model with hyper-parameter

  5. Select the most accurate from tuning grid to finalize it with the workflow

  6. Fit the model with the stroke trainig dataset 7. Save the model as RDA file

  7. Load the model and fit with the stroke testing set.

Model Outcomes

The model training longer time to train, since we are working with bigger dataset.Now, we have the models, lets load it in and see its performance.

Loading the models

load("/Users/iderasalami/Documents/FINAL PROJECT REPORT/stroke_gb.rda")
load("/Users/iderasalami/Documents/FINAL PROJECT REPORT/stroke_rf.rda")
load("/Users/iderasalami/Documents/FINAL PROJECT REPORT/knn_fit.rda")
load("/Users/iderasalami/Documents/FINAL PROJECT REPORT/stroke_log.rda")
save(dtrf_class, file = "stroke_rf.rda")
save(lr_res, file = "stroke_log.rda")
save(knn_fit, file = "knn_fit.rda")
save(tune_bt_class, file = "stroke_gb.rda")

Interpreting the Model with Visuals

One of the most useful tools for visualizing the results of models that have been tuned is the autoplot function in r. This will visualize the effects that the change in certain parameters has on our metric of choice, roc_auc.

Logistic Regression

Logistic regression is a type of classification algorithm that uses a logistic function to model the relationship between the predictor variables and the probability of the binary outcome.The logistic function takes any input value and transforms it into a value between 0 and 1. It classifies the observation if it is greater than 0.5 as positive, otherwise as negative class. an image caption Source: AI Geek.

As shown below, we can see that the roc_auc logistic regression did pretty well, lets look forward to its performance on the training set

autoplot(lr_res)

Random Forest

Random Forest is a supervised learning algorithm that can be used for classification. It predicts the class label of a given input based on the majority vote of the classes predicted by multiple decision trees.The idea behind Random Forest is to grow multiple decision trees, each on a random subset of the original data and features, and then aggregate the predictions of these trees to make a final prediction. The randomly selected mtry is 6 , the accuracy increases at each node. In Random Forest, the key hyperparameters that control the performance and complexity of the model include: This is better than Logistic regressio, Lets see the result with Boosted trees

autoplot(dtrf_class) + theme_minimal()

Boosted Trees

A sort of ensemble learning technique called “boosted trees” for classification issues combines several decision trees to provide a more reliable and accurate model for predicting categorical outcomes. It then assesses the first tree’s errors before fitting a second decision tree to the first tree’s residuals. A new tree is fitted to the residuals of the previous trees each time, and this procedure is repeated for a certain number of trees. By combining all of the ensemble’s trees’ predictions, the final conclusion is reached. The final prediction is obtained by obtaining the majority vote (for classification) of the predictions from all the trees. Each tree provides a prediction based on the input features. Boosted Trees construct the trees in a sequential manner, whereas Random Forests construct the trees individually and then aggregate their predictions at the conclusion. In the diagram below, Gradient boost performance looks better than the random forest.

.

autoplot(tune_bt_class) + theme_minimal()

strokerf_fit <- finalize_workflow(dt_class_wf, best_rf)
strokerf_fit <- fit(strokerf_fit, data_train)

Choosing the best model

We will now predict the KNN,Random Forest, Boosted Trees and Logistic model on the training data and get the roc_auc of each models

stroke_knn <- predict(knn_fit, new_data= data_train, type ="prob") %>%
  bind_cols(data_train %>%select(stroke)) %>%
   roc_auc(stroke, .pred_yes)

stroke_rf_auc<-augment(strokerf_fit, data_train) %>% 
  select(stroke, starts_with(".pred")) %>% 
  roc_auc(stroke, .pred_yes)

stroke_boosted_reg_auc<-augment(final_bt_model, data_train) %>% 
  select(stroke, starts_with(".pred")) %>% 
  roc_auc(stroke, .pred_yes)


stroke_loggr<-augment(stroke_pelog, data_train) %>% 
  select(stroke, starts_with(".pred")) %>% 
  roc_auc(stroke, .pred_yes)


rocAucTable = bind_rows(
                       stroke_knn,stroke_rf_auc,stroke_boosted_reg_auc,stroke_loggr)%>%
  
                        mutate(.metric = c("KNN","Random Forest","Boosted Trees","Logistic")) %>%
                        dplyr::select(.metric, .estimate)
rocAucTable %>%
  arrange(rocAucTable)
# A tibble: 4 × 2
  .metric       .estimate
  <chr>             <dbl>
1 Boosted Trees     0.918
2 KNN               0.987
3 Logistic          0.854
4 Random Forest     0.910

As shown in the tibble. KNN has the overall best ROC AUC score with 0.9869 with boosted trees behind at 0.918. This is just fitted with the training set lets explore the performance of all models with the testing set.

Selecting the best on Test set

stroke_log_test<-augment(stroke_pelog, data_test) %>% 
  select(stroke, starts_with(".pred")) %>% 
  roc_auc(stroke, .pred_yes)

test_knn <- predict(knn_fit, new_data= data_test, type ="prob") %>%
  bind_cols(data_test %>%select(stroke)) %>%
   roc_auc(stroke, .pred_yes)

stroke_model_test <- augment(strokerf_fit,data_test) %>% 
  select(stroke, starts_with(".pred"))%>%
  roc_auc(stroke, .pred_yes)


stroke_boosted_test<-augment(final_bt_model, data_test) %>% 
  select(stroke, starts_with(".pred")) %>% 
  roc_auc(stroke, .pred_yes)

rocAucTable = bind_rows(test_knn,stroke_model_test,stroke_log_test,stroke_boosted_test) %>%
                        mutate(.metric = c( "KNN","Random Forest","Logistic Regression","Boosted")) %>%
                        dplyr::select(.metric, .estimate)
rocAucTable
# A tibble: 4 × 2
  .metric             .estimate
  <chr>                   <dbl>
1 KNN                     0.577
2 Random Forest           0.812
3 Logistic Regression     0.836
4 Boosted                 0.817

.

WOW! Surprisingly, Logistic Regression performed better when fitted with testing data than when fitted with training data. The boosted tree, which was the second-best performer in the training set, also did well here with a difference of just 0.015. KNN appears to be very different from the others; this could be because the model was not tuned which might not be well generalised like the Logistic.

Logistic Regression

augment(stroke_pelog, new_data = data_test) %>%
  roc_auc(stroke, .pred_yes) 
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.836

Lets explore our best model by checking the variables importance ,as we can see from below diagram ,“age”,“work_type,self_employed,worktype_private” are the top variable importance in the model.

library(vip)

stroke_pelog %>% extract_fit_parsnip() %>%
  vip() +
  theme_minimal()

augment(stroke_pelog, new_data = data_test) %>%
  roc_curve(stroke, .pred_yes) %>%
  autoplot()

To visualize this result on the test set, we will once plot the ROC curve. We can see it is curved up and this supports the ROC AUC score. Therefore, the Logistic regression did better on the test set comapred to the others.

Testing on New Patient Record

We will now test the best-performing model on bad and good samples

Testing on the Bad Pateient Record

stroke_bad_test_example <- data.frame(
  age=70,
  gender= as.factor('Male'),
  hypertension = as.factor( 1),
  heart_disease= as.factor( 1),
  smoking_status =as.factor("smokes"),
  ever_married = as.factor("Yes"),
  work_type =as.factor( "Private"),
  residence_type= as.factor("Urban"),
  avg_glucose_level=180,
  bmi=89
)
predict(stroke_pelog, stroke_bad_test_example, type = "class")
# A tibble: 1 × 1
  .pred_class
  <fct>      
1 yes        

From the following result, we can see that the model correctly predicted “yes” based on the input data, taking into account the age, low glucose level, and other characteristics.

Testing on the Good Patient Record

stroke_good_test_example <- data.frame(
  age=25,
  gender= as.factor('Female'),
  hypertension = as.factor( 0),
  heart_disease= as.factor( 0),
  smoking_status =as.factor("never smoked"),
  ever_married = as.factor("Yes"),
  work_type =as.factor( "Private"),
  residence_type= as.factor("Rural"),
  avg_glucose_level=80,
  bmi=25
)
predict(stroke_pelog, stroke_good_test_example, type = "class")
# A tibble: 1 × 1
  .pred_class
  <fct>      
1 no         

The model correctly predicted “no” based on the input data, taking into account the age, low glucose level, and other characteristics. This is evident from the above result performed well.

Conclusion

This project has sliced the stroke prediction problem through research, data manipulation, model development, and implementation. The KNN model performed best on our training set data set. While this model did not do well on the training set, it allowed potential for the other adjusted models to outperform it, namely the Logistic model. For future computations, we can increase the performance of the model by dividing it into a smaller proportion of training data. Also, we might examine additional models, such as SVM and neural network, for improved outcomes. Overall, the ROC AUC metrics yielded a good score, which enabled us to accurately forecast the patient samples. Since this is a health-based use case, additional metrics can be employed to re-evaluate performance. We have reached the end of the project, dont forget health is wealth.